Goto

Collaborating Authors

 clustering problem


Unsupervised Feature Selection for the k -means Clustering Problem

Neural Information Processing Systems

We present a novel feature selection algorithm for the k -means clustering problem. Our algorithm is randomized and, assuming an accuracy parameter \epsilon \in (0,1), selects and appropriately rescales in an unsupervised manner \Theta(k \log(k / \epsilon) / \epsilon 2) features from a dataset of arbitrary dimensions. We prove that, if we run any \gamma -approximate k -means algorithm ( \gamma \geq 1) on the features selected using our method, we can find a (1 (1 \epsilon)\gamma) -approximate partition with high probability.


Feature Selection in Clustering Problems

Neural Information Processing Systems

A novel approach to combining clustering and feature selection is pre- sented. It implements a wrapper strategy for feature selection, in the sense that the features are directly selected by optimizing the discrimina- tive power of the used partitioning algorithm. On the technical side, we present an efficient optimization algorithm with guaranteed local con- vergence property. The only free parameter of this method is selected by a resampling-based stability analysis. Experiments with real-world datasets demonstrate that our method is able to infer both meaningful partitions and meaningful subsets of features.


Distributional Individual Fairness in Clustering

arXiv.org Machine Learning

In this paper, we initiate the study of fair clustering that ensures distributional similarity among similar individuals. In response to improving fairness in machine learning, recent papers have investigated fairness in clustering algorithms and have focused on the paradigm of statistical parity/group fairness. These efforts attempt to minimize bias against some protected groups in the population. However, to the best of our knowledge, the alternative viewpoint of individual fairness, introduced by Dwork et al. (ITCS 2012) in the context of classification, has not been considered for clustering so far. Similar to Dwork et al., we adopt the individual fairness notion which mandates that similar individuals should be treated similarly for clustering problems. We use the notion of $f$-divergence as a measure of statistical similarity that significantly generalizes the ones used by Dwork et al. We introduce a framework for assigning individuals, embedded in a metric space, to probability distributions over a bounded number of cluster centers. The objective is to ensure (a) low cost of clustering in expectation and (b) individuals that are close to each other in a given fairness space are mapped to statistically similar distributions. We provide an algorithm for clustering with $p$-norm objective ($k$-center, $k$-means are special cases) and individual fairness constraints with provable approximation guarantee. We extend this framework to include both group fairness and individual fairness inside the protected groups. Finally, we observe conditions under which individual fairness implies group fairness. We present extensive experimental evidence that justifies the effectiveness of our approach.


Ising-based Consensus Clustering on Specialized Hardware

arXiv.org Machine Learning

The emergence of specialized optimization hardware such as CMOS annealers and adiabatic quantum computers carries the promise of solving hard combinatorial optimization problems more efficiently in hardware. Recent work has focused on formulating different combinatorial optimization problems as Ising models, the core mathematical abstraction used by a large number of these hardware platforms, and evaluating the performance of these models when solved on specialized hardware. An interesting area of application is data mining, where combinatorial optimization problems underlie many core tasks. In this work, we focus on consensus clustering (clustering aggregation), an important combinatorial problem that has received much attention over the last two decades. We present two Ising models for consensus clustering and evaluate them using the Fujitsu Digital Annealer, a quantum-inspired CMOS annealer. Our empirical evaluation shows that our approach outperforms existing techniques and is a promising direction for future research.


A Unified Framework for Tuning Hyperparameters in Clustering Problems

#artificialintelligence

Selecting hyperparameters for unsupervised learning problems is difficult in general due to the lack of ground truth for validation. However, this issue is prevalent in machine learning, especially in clustering problems with examples including the Lagrange multipliers of penalty terms in semidefinite programming (SDP) relaxations and the bandwidths used for constructing kernel similarity matrices for Spectral Clustering. Despite this, there are not many provable algorithms for tuning these hyperparameters. In this paper, we provide a unified framework with provable guarantees for the above class of problems. We demonstrate our method on two distinct models.


Authorship Analysis as a Text Classification or Clustering Problem

#artificialintelligence

Many such'literary' quandaries are inspected by expert linguists as analysing and categorising discourses is fairly complex, domain-specific and highly multi-dimensional. One of latest research areas in Natural Language Processing is Authorship Analysis which is trying to leverage the computational power of big-data and artificial intelligence combined with linguistics and cognitive psychology to encode automatic classification of texts, identification of author profiles and resolution of authorship conflicts. This article is an attempt to introduce the concept of authorship analysis, its application areas and the major sub-tasks associated with it. The art and science of discriminating between writing styles of authors by identifying the characteristics of the persona of the authors and examining articles authored by them is called Authorship Analysis. Consequentially, it also aims to determine biographic characteristics of an individual like age, gender, native language and cognitive psychological traits based on "available information" pertaining to that individual. In this article, "available information" refers to textual data only in the context of authorship analysis, however, information in this context could go beyond textual format as it might also involve usage of multi-modal observations.